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REMARKS 

The specification has been amended to correct minor informalities. Claims 12-16 were 
under consideration in the application. Claims 12, 13 and 16 have been amended. Claims 25-34 
have been added. Accordingly, Claims 12-16 and 25-34 are currently under consideration. 

Support for the amendments may be found throughout the specification and claims as 
originally filed. Specifically, support for the amendments to claim 12 may be found in originally 
filed claim 12(a). 

Support for the amendments to claim 13 may be found in originally filed claims 12 and 

13. 

Support for the amendments to claim 16 may be found in originally filed claim 12 and 

16. 

Support for new claim 25 may be found in originally filed claims 12 and 16. 

Support for new claim 26 may be found in originally filed claim 12. In particular, 
support for the hybridization and wash conditions recited in claim 26 may be found at page 21, 
lines 14-17. 

Support for the language "costimulates T cell proliferation when the polypeptide is 
present on a first surface and a molecule that transmits an activating signal via the T cell receptor 
is present on a second, different surface" recited in claims 26, 29, and 32 can be found at least at 
page 3, lines 15-25; at page 10, lines 20-25; and in the Examples of the specification. 

Support for new claim 29 may be found at page 3, lines 26-29 of the specification. 

Support for new claim 32 may be found at page 5, lines 13-17 of the specification. 

Support for new claims 27, 30, and 33 may be found in originally filed claims 12-14. 

Support for new claims 28, 31, and 34 may be found in originally filed claims 12-15. 

No new matter has been added by way of these amendments to the specification and 

claims. 

Cancellation of and/or amendments to the claims should in no way be construed as 
acquiescence to any of the Examiner's rejections and were done solely to expedite prosecution of 
the above-identified application. Applicants reserve the option to further prosecute the same or 
similar claims in the instant or in another patent application(s). 
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Domestic Priority 

The Examiner objects to the Amendment filed on November 2, 2001 imder 35 U.S.C. 
§132 because it "introduces new matter into the disclosure." More specifically, the Examiner 
indicates that Applicant is required to cancel the new matter, the incorporation by reference 
of serial number 09/644,934. 

The "Related Applications" section of the application has been amended to remove the 
incorporation-by-reference to the parent application of which the instant application is a 
divisional. Accordingly, Applicants request withdrawal of the foregoing objection. 

Title of the Invention 
The Examiner objects to the title as not being descriptive of the invention to which the 
claims are directed. In response. Applicants have amended the title and therefore request that 
the foregoing objection be withdrawn. 

Objection to the Disclosure 

The Examiner objects to the specification as it contains embedded hyperlinks and/or 
other forms of browser-executable code. In response to this rejection hyperlinks have been 
deleted. 

The Examiner has indicated that the specification recites trademarks which should be 
capitalized and accompanied with the generic terminology. In response, Applicants have 
amended the specification to correctly note trademarks. 

The Examiner has also requested that the application be reviewed for all possible minor 
errors. In response, the application has been reviewed and amended to correct minor 
informalities. 

Accordingly, Applicants respectfully request withdrawal of the objection. 

Objection to Claim 16 

The Examiner objects to claim 16 because of a typographical error. In response, claim 
16 has been amended herein to recite "amino acids 19-245 of SEQ ID NO:2." Accordingly, 
Applicants respectfully request reconsideration and withdrawal of the objection. 
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Rejection of Claims 12-16 Under 35 U.S.C. §112, Second Paragraph 
The Examiner rejects claims 12-16 under 35 U.S.C. §112, second paragraph as being 
indefinite for failing to particularly point out and distinctly claim the subject matter which 
applicant regards as the invention. More specifically the Examiner states that "[c]laim 12, and 
dependent claims thereof, are ambiguous in reciting 'stringent conditions'., .in the absence of a 
definition that clearly provides the metes and bounds of this phrase, it is unclear which 
conditions are actually claimed." 

Applicants respectfully traverse the foregoing rejection on the grounds that the ordinarily 
skilled artisan would find the pending claims to be clear and definite based on the teachings in 
Applicants' specification and the general knowledge in the art. However, in the interest of 
expediting prosecution. Applicants have amended the claim referring to stringent hybridization 
conditions (new claim 26), as suggested by the Examiner, to recite the hybridization and wash 
conditions disclosed on page 21, lines 14-17 of the specification. In view of the foregoing. 
Applicants respectfully request reconsideration and withdrawal of the rejection under Section 
112, second paragraph. 

Rejection of Claims U-ld Under 35 U.S.C. §112, First Paragraph- Written Description 
Claims 12-16 have been rejected under 35 U.S.C. §112, first paragraph as containing 
subject matter which was not described in the specification in such a way as to reasonably 
convey to one skilled in the relevant art that the inventors, at the time the application was filed, 
had possession of the claimed invention. The Examiner states that 

Applicant does not appear to have identified which fragments of any particular 
length or over a particular region are essential for the function of the polypeptides 
of SEQ ID NOS:2 or 4. . . Applicant does not appear to have provided a 
representative number of species of allelic variants, nor to have provided a 
description of mutational sites that exist in SEQ ID NOS:l or 3. 

More specifically, the Office Actions states, 

Applicant does not appear to have established which residues within either the 
full length sequence or the fragments can be changed and still maintain a 
correlative function shared by a genus of nucleic acids having at least 50% 
identity to SEQ ID NOS: 1 or 3 or hybridizing to these sequences under stringent 
conditions. 
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Applicants respectfully traverse and submit that there is sufficient written description in 
Applicants' specification regarding B7-4 polypeptide firagments to inform a skilled artisan that 
Applicants were in possession of the claimed invention at the time the application was filed, as 
required by section 112, first paragraph (see M.P.E.P. §2163.02). The sufficiency of a 
disclosure in meeting the written description requirement of 35 U.S.C. §1 12 for claims to a 
genus of cDNAs was addressed in the Eli Lilly case in which the Court stated that 

[a] description of a genus of cDNAs may be achieved by means of a recitation of 
a representative number of cDNAs, defined by nucleotide sequence, falling within 
the scope of the genus or a recitation of structural features common to the 
members of the genuSy which features constitute a substantial portion of the 
genus. 

The Reagents of the University of California v. Eli Lilly and Co.. 43 USPQ2d 1398, 1406 (Fed. 
Cir., 1997). Therefore, as articulated by the Federal Circuit, a claim to a genus of chemical 
compounds satisfies the written description requirement when its accompanying specification 
either defines by sequence a representative number of its members falling within the scope of the 
genus or when its accompanying specification defines the structural features common to a 
substantial portion of the genus. 

Furthermore, in Example 1 5 of the Interim Guidelines for Examination of Patent 
Applications Under the 35 U.S.C. §112, First Paragraph Written Description Requirement ihQ 
"theoretical specification" discloses a messenger RNA sequence, SEQ ID NO:l, which encodes 
a human growth hormone. The "theoretical specification" claims antisense molecules that 
inhibit the production of human growth hormone. The Guidelines provide that: 

[c]onsidering the specification's disclosure of (1) the sequence (SEQ ID NO: I) 
which defines and limits the structure of any effective molecules such that one 
skilled In the art would be able to Immediately envisage members of the genus 
embraced by the claim and 2) the functional characteristics of the claimed 
invention as well as a routine art-recognized method of screening for antisense 
molecules which provide further distinguishing characteristics of the claimed 
invention, along with, 3) the general level of knowledge and skill in the art, one 
skilled in the art would conclude that applicant was in possession of the 
invention the claimed Invention Is adequately described. (Emphasis added). 
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The instant specification satisfies this requirement for the claimed invention because the 
claimed genus of B7-4 polypeptide firagments of the present invention is defined by structural 
features that are described in the specification, recited in the claims and commonly possessed by 
its members. 

For example, the instant specification describes the nucleotide sequence of the B7-4 
nucleic acid molecules of the invention (SEQ ID NOS: 1 and 3) and the amino acid sequence of 
the B7-4 polypeptides of the invention (SEQ ID NOS:2 and 4) which define and limit the 
structure of any nucleic acid or polypeptide fragments such that one skilled in the art would 
be able to immediately envisage members of the genus embraced by the polypeptide claims. In 
particular, Applicants disclose two novel human B7-4 molecules. One form is a naturally 
occurring B7-4 soluble polypeptide, i.e., having a short hydrophilic domain and no 
transmembrane domain, referred to as B7-4S (shown in SEQ ID NO:2). The other form is a 
cell-associated polypeptide, /.e., having a transmembrane and cytoplasmic domain, referred to as 
B7-4M (shown in SEQ ID NO:4). Indeed, the B7-4 protein and nucleic acid molecules comprise 
a family of molecules having certain conserved structural and functional features. For example, 
the B7-4 family of molecules share conserved regions, including signal domains, IgV domains 
and the IgC domains. The instant specification identifies important sites on the B7-4 molecules 
of the present invention, such as the signal sequences of SEQ ID NOS:2 and 4, which are shown 
from amino acids 1-18; the IgV domains of SEQ ID NOS:2 and 4, which are shown from about 
amino acids 19-134; and the IgC domains of SEQ ID NOS:2 and 4, which are shown from about 
amino acids 135-227. In addition, the hydrophilic tail of the B7-4 exemplified in SEQ ID NO:2 
comprises a hydrophilic tail shown from about amino acid 228-245. Moreover, the B7-4 
polypeptide exemplified in SEQ ID NO:4 comprises a transmembrane domain shown from 
about amino acids 239-259 of SEQ ID NO:4 and a cytoplasmic domain shown from about amino 
acids 260-290 of SEQ ID NO:4. 

The claims which include variation in the structure of the claimed polypeptide molecule 
include the limitation that the polypeptide molecules referred to therein costimulate T cell 
proliferation in vitro when the polypeptide is present on a first surface and a molecule that 
transmits an activating signal via the T cell receptor is present on a second, different surface. 
This is a readily testable function which is described in the instant application. 
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Furthermore, as the Examiner is well aware, the generation of nucleic acid or polypeptide 
fragments is routine in the art. For example, (as indicated in Example 15 of the Interim 
Guidelines) any specified fragment can be ordered from a commercial synthesizing service. 

Thus, the pending claims recite the structure and function of the claimed polypeptides in 
such a way as to reasonable convey to one of ordinary skill in the art that the inventors had 
possession of the claimed invention at the time of filing. Accordingly, Applicants respectfully 
request reconsideration and withdrawal of the rejection of the pending claims under 35 U.S.C. 
§112, first paragraph. 

Rejection of Claims 12-16 Under 35 U.S.C. §112, First Paragraph - Enablement 

The Examiner rejects claims 12-16 under 35 U.S.C. §112, first paragraph "because the 
specification, while being enabling for SEQ ID NOS:2 or 4, does not reasonably provide 
enablement for the various polypeptides comprising fragments, allelic variants encoded by 
nucleic acids which hybridize with disclosed sequences, or polypeptides which are at least 50% 
identical, or are encoded by nucleic acids which are at least 50% identical, to disclosed 
sequences. . .to make and use the invention commensurate in scope with these claims." More 
specifically, the Examiner states that "Applicant does not appear to have identified which 
polypeptide fragments are essential to the function of a B7-4 protein." 

The test of enablement is not whether any experimentation is necessary, but whether if 
any required experimentation is undue. In Re Angstadt, 537 F.2d 498, 190 USPQ 214 (CCPA 
1976); M.P.E.P. §2164.01. As the Examiner is aware, factors to be considered when 
determining whether a disclosure requires undue experimentation include the nature of the 
invention, the state of the prior art, the relative skill of those in the art, the amount of direction or 
guidance disclosed in the specification, the presence or absence of working examples, the 
predictability or unpredictability of the art, the breadth of the claims, and the quantity of 
experimentation which would be required in order to practice the invention as claimed. Ex Parte 
Forman, 230 USPQ 546 (BPAI 1986). 

Applicants respectfully traverse this rejection on the grounds that Applicants' 
specification contains ample guidance as to how one of skill in the art would make and use the 
claimed invention. Specifically, the instant specification describes the nucleotide sequences of 
the B7-4 nucleic acid molecules of the invention (SEQ ID NOS:l and 3) and the amino acid 
sequences of the B7-4 polypeptides of the invention (SEQ ID NOS:2 and 4) which define and 
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limit the structure of any polypeptide fragments such that one skilled in the art would be able to 
immediately envision members of the genus embraced by the polypeptide fragment claims. As 
set forth above, the B7-4 family of molecules share a number of conserved regions, including 
signal domains, IgV domains and the IgC domains. Based on structural analysis. Applicants 
have identified important sites on the molecules of the present invention, including the signal 
sequence (amino acid residues 1-18 of SEQ ID NOS: 2 and 4); the IgV domain (amino acid 
residues 19-134 of SEQ ID NOS: 2 and 4); and the IgC domain (amino acid residues 135-227 of 
SEQ ID NOS:2 and 4). In addition, Applicants have identified other important polypeptide 
fragments, including the hydrophilic tail of the B7-4 (amino acid residues 228-245 of SEQ ID 
NO:2); a transmembrane domain (amino acid residues 239-259 of SEQ ID NO:4); and a 
cytoplasmic domain (amino acid residues 260-290 of SEQ ID NO:4). It is well know in the art 
that "non-essential" amino acid residues, e.g., residues that can be altered from the wild-type 
sequence of a B7-4 molecule without altering the functional activity of a B7-4 molecule, can be 
readily identified by one of ordinary skill in the art by performing an amino acid alignment of 
B7 family members and determining residues that are not conserved. 

Moreover, Applicants' specification discloses ample guidance as to how the polypeptide 
fragments of the invention may be generated and used in screening assays, diagnostic assays, 
prognostic assays, and the methods of treatment, e.g., therapeutic and prophylactic, as taught at 
page 58, line 6 through page 77, line 16 of the specification. Thus, one of ordinary skill in the 
art reading the foregoing teachings in Applicants' specification would have been able to use the 
claimed invention using only routine experimentation. 

The Examiner states that "allelic variants can encode proteins having drastically different 
fimctions, even when the proteins share a high level of sequence and structural homology" and 
cites Voet et al to support this contention. Further, the Examiner is of the opinion that "there 
appears to be insufficient guidance in the specification as fled [filed] to direct a person of skill in 
the art to select particular sequences as essential for the functional properties of a polypeptide 
comprising a sequence that has 'at least 50% identity' to the polypeptide of SEQ ID NO:2 or 4, 
or encoded by a nucleic acid which has 'at least 50% identity' to the nucleotide sequence of SEQ 
ID NOS: 1 or 3." The Examiner cites Attwood, Skolnick et al. and Coyle et al to support the 
position that protein structure similarity does not confirm the function of the related proteins. 
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Applicants respectfully traverse this rejection. While examples exist of polypeptide 
families wherein individual members have distinct, even opposite, biological activities, growing 
databases and improved search techniques, particularly the iterated PSI-BLAST tool, has yielded 
substantial improvement in secondary structure prediction accuracy. According to Rost, a copy 
of which is submitted herewith as Appendix A, "[s]econdary structure predictions are 
increasingly becoming the work horse for numerous methods aimed at predicting protein 
structure and function." Burkhard Rost, Review: Protein Seconday Structure prediction 
Continues to Rise (2001) J. Structural Biology 134: 204-218. Furthermore, as indicated by Rost 
"[s]tate-of-the-art methods now reach sustained levels of 76% prediction accuracy." In addition, 
the claims have been amended to increase the recited percent identity and to require that the 
polypeptides function to costimulate T cell proliferation in vitro when the polypeptide is present 
on a first surface and a molecule that transmits an activating signal via the T cell receptor is 
present on a second, different surface. 

It is Applicants' position that, based on the teachings in Applicants' specification, one of 
ordinary skill in the art would be able to make and use the claimed invention using only routine 
experimentation. In view of the foregoing. Applicants respectfully request reconsideration and 
withdrawal of the rejection. 

Rejection of Claim 12 Under 35 U.S.C. §102(b) 

The Examiner rejects claim 12 under 35 U.S.C. § 102(b) as being anticipated by GenBank 
entry Accession #AA292201 . The Office Action indicates that GenBank Accession #AA292201 
from about residues 9-497 is identical to SEQ ID NO:l from about residues 320-807. Moreover, 
the Examiner indicates that #AA292201 from about residues 9-430 is identical to SEQ ID NO:3 
from about residues 3 1 5-734. Thus, according to the Examiner, "AA292201 meets the 
limitation of a nucleic acid molecule which encodes a 1 5 amino acid contiguous fragment of 
SEQ ID NOS:2 or 4. Likewise, #AA292201 encodes a polypeptide [that] is at least about 50% 
identical to the amino acid sequence of either SEQ ID NO:2 or 4, or which is encoded by a 
nucleic acid which is at least 50% identical to SEQ ID NO:l or 3, since no limitation requiring 
the identity to be over the full length of the polypeptide is recited." 

The objection to claims 12-16 is respectfully traversed. Claims 12-15 are directed to 
isolated polypeptides. AA292201is merely a nucleic acid sequences, and does not teach a 
polypeptide. The AA292201 sequence, obtained from the sequencing of expressed sequence 
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tags (EST's), does not contain any coding information. For example, no open reading frames, 
start sites of translation, or conclusive identification of the occurrence of a specific translated 
protein product is taught by the reference. As such, the cited reference does not disclose the 
isolated polypeptide specified in claims 12-15. Applicants additionally point out that claim 14 
specifies that the polypeptide further comprises heterologous amino acid sequences which are 
derived from an immunoglobulin molecule. The reference fails to disclose such amino acid 
sequences. 

Applicants further argue that the invention of the isolated polypeptide of the pending 
claims is not obvious over the cited art. Although conceptual translation of fragments of the 
nucleotide sequence of the EST*s might produce fragments of the B7-4 polypeptide, prior to the 
immediate disclosure, no guidance existed in the art with respect to which conceptual 
translations of the EST's were biologically relevant. Prior to Applicants' findings, the B7-4 
polypeptides of the pending claims were not known to exist in nature, and the fixnction of the 
polypeptide was not known. Thus, the claims of the invention which are directed to isolated 
polypeptides of B7-4, or fragments or variants thereof, are novel and non-obvious in view of the 
AA292201 sequence. 
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SUMMARY 

In view of the above amendment. Applicants believe the pending application is in 
condition for allowance. 

Applicants believe no additional fees to be due at this time. However, if additional fees 
are due, please charge our Deposit Accoimt No. 12-0080, under Order No. GNN-004ADV from 
which the undersigned is authorized to draw. 



Dated: December 29, 2004 



Respectfully submitted, 




By 
Hathd 

Registration No.: 46, 488 
LAHIVE & COCKFIELD, LLP 
28 State Street 

Boston, Massachusetts 02109 
(617) 227-7400 
(617) 742-4214 (Fax) 
Attorney/Agent For Applicants 
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MetkocU picedicting protein secondary structure 
improved gabstantiaUy in the 1990s tiircmeh tJic nae 
of evolutionary information taken frata the diver- 
gence of proteins in the same stmctural f aniily« Be- 
eently, the evolationary information resulting from 
improved searches and larger d at abases has again 
boosted prediction aecorsbcy hy more than four per- 
centage points to its current hei^t of around 7G% 
of all residues predicted correctily in one of the 
tibree states, heli^ strand, and odien Hie past year 
also faronght successful new concepts to the fidd. 
Tliese new methods may be particularly interesting 
inli^bt of the improvements achieved through dm- 
pie combining of CTiigting methods- Divergent evo- 
lutionary profiles contain enou^ information not 
only to SubfitautiaUy improve prediotiou accuracy, 
but also to correctly predict long stretches of iden* 
tical residues observed in alternative secondary 
structure states depending on nonlocal conditions* 
An eacsmple is a meOiod automatically identifying 
structural switcAbes and thus finding a remarirable 
csonnettion between predicted secondary structure 
and aspects of function. Secondary structure pire- 
dictions are increasingly becoming the work horse 
for numerous methods aimed at predicting protein 
structure and function. Is the recent increaae in 
accural significant enough to make predictions 
even more useful? JSecause the recent improvement 
yields a better jnrediction of segments, and in par- 
ticular of fi strands, I believe the answer is affirma* 
tivc- What is t^e-limit of prediction accuracy? We 
challfiee. a 



INTBODUCnON 

History. linusPaiiUngcoiirectlygixe^ 
Qiation of helices and strands (14, 15) (and fiBilsely 
hypothesized other strucbirasX Tluee years before 
IPmlinfife guess ^?as Terified the infldlcatians of 
tfie fint Xpcay stmctares (16, 17)« one svonp had 
alraady veuUired to predict seoondaiy stntctme 
flrani floqiienoe (16)u T!bB flret-gooeratlan |iredidiou 
metbodsfUIowingte 



based on single amino acid propensities (19). The 
second-generation methods dominating the scene 
until the early 1990s used propensilies for segments 
of 3-51 adjacent residues (19). Basically any imag- 
inable theoretical algorithm had been applied to the 
problem of predicting secondary structure firom se- 
quence. However, it seemed thatprediction accuracy 
stalled at levels slightly above 60% (percentage of 
residues predicted correctly in one of tihe three 
states: helix, strand, and other). The reason for this 
limit vras the restriction to local infonooatloii. Can 
we introduce Gome c^bal iiiformation into local 
stretches of remduefl? 

Secondary structure prediction profits from diver- 
gence. Early on, Dickerson et oL (20) realized that 
information contained in multiple alignments can 
improve predictions. Zvelebil et at (21) incorporated 
this concept into an automatic prediction method. 
However, tiie breakOurough of the third-generatian 
methods to levels above 70% accuracy required a 
combinatian of larger databases with more ad- 
vanced algoriiluns (19, 22). nie major component of 
these new methods was the use of evohitionaiy in- 
formation. All naturally evolved pnitdns with more 
than 35% pairwise identical residues over more than 
100 aligned residues have similar structures (23). 
This seemingly implies an amazing stability of 
structure with respect to sequence divergence. How- 
ever, this average figure hides the fact that neutral 
mutatLons-are extremely unlikely. Supposedly-most 
mutations result in protdns that will not adopt any 
globular structure, at aU« In other words, only a tiny 
fi:action of all possible proteins exist Hence, posi- 
tion-specific profiles describing whidh residues can 
be exchanged against whidi athers at which poal- 
tions contain crudal infbrmatian about protein 
etractore. One consequence is that stretdies 
17 a4f acent residues impUdly contain some infiir- 
mation about hmg-range interaeOoos and environ- 
ment dnce <lie profile reflects evobsttonaor con- 
etrafnto. Uobig evotntlonaxy diraoeenoe was ttie 
start kesr to tiie tidrd^esneratiini prediGfaan meOi- 
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FIG. 1. Pmfilfr-haadd seimiies extend evohitionaiy iaCormatioti. Hid d<i<id fiignifids a prtstoin etroctaral foixulj for the query proteiii 
U,Le^ aU protdns Hmt hove a sixnilar 3D etacvctwre, A ample pairwiae oampaiison of U with all other pxotdns covars the ^cafa sone* of 
acxiqeiioe flHgnsieQt (ip^ 

of TJ. For exasiiple« FSI-BLAST ataxts the next Keraliini with tiie fiam^-^pedfic profile iprai 1^ the prateins found in the safe cone. 
Seordiing the databa^^ again with this profile readies safiJy Into tba twilig^ nma Ccona reajdied marked by dooble'lmed egg in di'cated 
in figwe). However, no coxrent method generaDy readies all members of fsunily IL Furthermore, in particular for P8I-BLAST the new 
rdgum may fall aaioide of the mitial aale zone (blade aubregum of the oafe tone). Pin^lfy* the re^ons that oould have been readied by 
eequence-space hopping or intermediate sequence seardies (dashed drdes around five initisl hits; (120, 121)) are not entirely covered by 
the ptofila^baaad eearch. The tricky bit is to avoid the posaibility that the profile will pidc unrelated prateine (tT^naparent egg) and thus 
connect two separate atnictural families (U and X). Conclusions: (0 Iterated PSI-BIAST seardies can safdtjr idcntifir fidrly divergent 
fiunily momfaeia. (ii) Clooe homologuaa may bo loat daring the aictanaian of the &nuly. (iii) Ihe advaneed eeaxdi. can load the results astray. 



ods« Knowing 3D etructure/ we can identify very 
distant relationahips between proteina Uiat would 
improve accural even further (24)- Can we build 
larger and more diverged families without knowing 
structure? 



^ Abhreviationa used: 3D otmctsure, thrae-dimenaional (coordi- 
nates of protein structure); ID structure, one-dimensioiml (e.g., 
aequenoa or etring of secondary atructura); ASP, method idanti- 
ffinQ regions of stracture ambivalent in lespooso to ^bal 
dian^ (U; DSSP, <<fttflbase and method convcrtiDg ZD coordi- 
nates into seoondaiy stroctore (Zk HMBdSTR, hidden Madoov 
modd-baaod prodidion of aseondaiy gtroctme <8X Iftod, meftod 
twnlrfnliig oBuer pFedi ct ii Q n rocftnods C4» 6h JP*p&St^ dlw aijpent 
profile (FfiS-BlASTVbased nemal nebonak predicti on tS) ; PHD, 
irimplfl profil»liaaQa smnl nflbvrack ptedktton C7)S FfiQpd, di- 
ireigoit prafite (PSLJBIABTHMflei aeo^ 
0)j IfiOF^ divugcnt profilo4M9od ncnral sieCwuik p re^ B ct fan 
tcained ind tooted wftti FSK^OJ^ 

ftenthw spedfie pmfae^inedt fast and mcnnte alignment 
ncfibod an; FSEFSOSD. ^kmteuA (Mflla OPStBLASI^boMd 
ttenwlnfl t ii wk pu B flk tfBndl); fi<flt W1>tfuu» ULUi i l ni t w u A yinh 



New database searches extend family divergence. 
It was alao recognized very early on Ibat information 
from the poaiiion-Gpecific evolutionaty exchange 
profile of a particular protein fiamily facilitates dis- 
covering more distant members of that family (20). 
Automatic database search methods successfully 
used position-spedfic profiles for searching (25)* 
However, the breakthrough for large-acale routine 
searches was achieved with the development of PSI- 
BLAST (10) and hidden Markov models (12, 26). In 
partioolar, the gapped, profile-based, and iterated 
aeardi todl PSI-BLAST continues to revolutionize 
Ihe field of protein sequenoe analyms through its 
unique comlifaiatUm of epeed and aocoraqy; More 
distant relationships are found .l3irou^ Herattoa 
atarting from the safe zone of oompariaona and in- 
truding deeply and reliably into the twOii^ cone 
(Rg.». 

T^ldB left 0ui here, llda review fbcuaes on meth- 
ods predicting aecondaiystnicture for g^bul^ 
iein6« in generaL At the infimcgr of ana!|yiBiiig Hie 
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pTOteome of entirely seqoenoed organisms, fhe most 
useful stnictore prediction motiiods are those that 
fimis on particular daasea of proteina, such as pro- 
teins containing membrane helices and coiled-coil 
regions (27-30). For predicting the topol<^ of heli- 
cal membrane protons, a number of new methods 
add interesting new facets (31-36). However, no 
method has truly used the flood of recent experimen- 
tal information about membrane proteins (37). 
Overall, membrane helices can be predicted much 
more accurately than globular helices. The current 
state of the art is to correctly predict all membrane 
helix topology for more than 80% of the proteins and 
to falsely predict membrane heUces for less than 4% 
of all globular proteins. We have recently come 
aoroaa evidence auggesting that this figure overesti- 
mates performance (Rost, unpublished). Clearly, 
methods developed to predict helioes in globular pro- 
teins go completely wrong for membrane helioea! In 
contrast, porins appear to be predicted relatively 
accuratdy by methods developed for globular pro- 
tons (38, 39). Few methods apedfically predicting 
coiled-cofl regions have been published recently (old- 
or review in (40)). Two interesting developments are 
fhe prediction of the dimeric state of coQed-coils (41) 
and a method predicting 3D structure for ooiled-cqil 
regions (42). In iact, the latter is the only exiKting 
method predicting 3D structure below 2-A. main 
chain deviation over more than 30 re^dues. Another 
example of successful specialized secondary stnic- 
ture prediction methods is the focus on $ turns (43, 
44). Hie melhod firom the Thornton group appears to 
be the most accurate current means of predicting 
turns. Successful methods specialized- in predicting 
o-helix propensities have resulted from the experi- 
mental studies of short peptides in solution (45, 46). 
Neither the turn nor the helix-in-Golution methods 
haveyet been combined vnlQi oth^ secondary struc- 
ture prediction methods. 
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Jones broke through by using PSI'-BLAST 
searches of large databases. David Jonea pioneered 
tiiie use of iterated PSI-BLAST searches automati- 
cally (11). The most important stqp achieved by the 
resulting method PSIFBED has been the detailed 
Strategy of avQidiing p<Auti0n of t^ 
imrelntpd protems (Fig. 1). To avdd this trap, the 
database searched must be filtered first (11), 
CAiBaP intu Mw%ft at whidi David Jones introduced 
PtUPijuai)» Kevin Karplna and colleagues presented 
fhdr pre^ctioa method (SAM-T993e^, finding more 
diverged profiles Unoaijh hidden Markov models 
(47,48)«Beoenl3y, Ooff and Barton also successfully 
need FSUSLAST aHgnnHwiftB finr JPred2 (see 49). 



Jennings etoL (50) explore an alternative to increas- 
ing divergence: they started with a safe zone align- 
ment thruu£^ ClustalW (51) and HMMer (26) and 
iteratively refined the alignment using the second- 
ary structure prediction from DSC (52). The result- 
ing aligmnent is reported to be more accurate and to 
yield higher prediction accuracy than the initial 
ClustalW/HMMer alignments (50). Bow accinrate is 
secondary structure prediction in 2000? 

Prediction accuracy peaks at 76% accuracy. The 
current best methods reach a level of 76% three- 
state per-residue accuracy (Table I). This constitutes 
a sustained level more than four jiercentage points 
above the last century's beet method not usmg di- 
verged profiles (PHD m Table D. Fortunately, the 
improvement is valid for hdix, strand, and nonr^u- 
lar repons (information and correlation indices in 
Table J). Furthermorei significantly fewer residues 
are confused between the states hdix and strand 
(BAD score. Table D. Finally, some new methods 
also improve in a more global sense by improving 
the accuracy of assigning the secondary structural 
class (aE-alpha, a]l-beta« alpha/beta, and other) 
hased*on the predicted content of regular secondary 
structure ((Hass score. Table D. 

Sources of improvement: Four parts database 
growth, three parts extended search, two parts other. 
Jonea aolicited two causes for the improved accu- 
racy: (i) training and (ii) tesfeig the method on PSI- 
BLAST profiles. Cuflf and Barton examined in detail 
how different alignment methods improve (6)« How- 
ever, which fraction of the improvement results from 
the mere growth of the database, which fraction 
results frxnn using more diverged profiles, and which 
fraction results from training on larger profiles? Ua- 
ing PHD frum 1994 to separate the effects (8), we 
first compered a non iterative standard BLAST (53) 
search against SWISS-EROT (54) with one against 
SWISS-PROT + TrEMBL (54) + PDB (55), The 
larger database improves performance by about two 
percentage points (8). Second, we compared the 
standard BLAST against the large database with an 
iterative PSI-BIAST-searchu This yielded loss than 
two percentage points in additional improvement 
(8), Thus, overall, the more divengent profile search 
against today^s databases supposedly improves any 
method using fil^gnynAnt information by almost four 
percentage points (PHDpsi in TaUe I). The 
meat gained by using PSI*BLAST profilea to develop 
tbemethodisrelativelyfimalhFHDpsi was trained 

on a amsJl database of not very divergent profiles, in 
1994; e^PK(>F was twdnedottPSI-BLASTim^^ 
of a20 (inies larger database in 20(X). The two 
by ody CM peroeatage point CTtf)^ 
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TABLEI 

Ajocwacy of Sooondaxy Structure Prediction Methodo^ 



Metiiod* 
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CoirE^ 


CorrL? 




BAiy 


PROF 


77.0 




73 


0.37 


0.67 


0.65 


0^6 


82 


2.2 


PSIPBED 


76.6 


76.5-78,3"* 


73 


0,37 


0.66 


0.64 


0.66 


81 


2.6 


SSpro 


76^ 


76 


71 


O.SG 


0.67 


0.64 


0J6 


83 


2.5 


JPred2 


75^ 


76.4 


70 


0.34 


0.65 


0.63 


0.54 


77 


2.4 


FBDpoi 


75,1 




70 


0.29 


0.64 


0.62 


0.53 


80 


2.9 


PHD 


71.9 


7L6 


68 


0.25 


0.59 


0.S9 


0.49 


77 


4.1 




78" 


77.8 
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" Data aet and sorting; The x^siilta are compiled by EVA (58), All methods for which details ore listed have been tested on 195 different 
new pKOtein strucfcutos (EVA version Febniaxy 200 1). None of these proteins was sinular to any iktotain uaod to davelop tha raapective 
m^ii*^ Tirmpriff''^ f^T T^r^y^at miA ftM: by gftbmary 1, 2001. for which we bad Multo. SoTtug and grouping te^^ 

concept: if tiiie dataset is too smaU to distinguiflh between two metliods, these two are grouped. For the given set of 195 protein s, this 
yielded thiM 0nmpa. Inside of eaeh group, yesidts axe sortod alphabetieally. Due to a ladk of data, I oould not add the perfiinnanx?e of 
8AM«T99aec (4a>, on a aet of 10 5 proteins SAM-T998ec appears comparable to flie best three methods: PSIPBED, S^pro. and PROP. Hie 
lesulta feom tha Oipenhagon method are set apart, since they ware not colkcfcad conAinuousLy hy EVA Ctbc method is not p fflid y 
available); xsther they were provided by tbe group in Denmark for this review and thus may have been based on marginally differing 

"^^^^IrfilTCviatiQna footnote in text; Copenhagen refers to the method ficom the group in Denmark (63); Wan^nf uan rafera to a me*hod 
predicting secoadaiy etructwaJ ddss fiem tho axnii»> aeid oompooitioxx, which xoay be the moat socorate audi method (59), 

^Threo-ataia par-xaaidua accoxacy, Le., number of residues p^dicted correetbr in one of tha thnaa atatea, halisc, strand, or othar 
(oonveraon of DSSP states (HG) haHx, CSB) ahcand; note that the per-ramdna accuracy tends to fevour methods ovexpredicUng 
twpTft g^ ^ffir stJ^QjCtmrei)* 

^Thx^^9taieper-midue accorax^puhlidi^ in original publieationof mathod: PSIPKED (11). SSpro (13), JProdS (6), FHD (122). 
'Tbxee-state per-segment score measuring the oveiiap between predicted and observed segments (75, 123X 

^ Motthew^s correiladon c»efiSdent 

^ Mfltthflw'a earrelatiftn for stata strand (124). 

' Mattiiew'scorrdationcoefiGcient for state other (124)w v „, ^ 

* Pareantaga of profceina eotroctly sorted into one of the four classes: all-alpha Qength > 60. hehx >45%. strand <5%). all-beta a<mgth > 
60, helix <5^, strand >4S%), alpha/beta aengih > 60. helix >30%, strand >20%), other (thresholds for classification &om (122, 125. 126). 

' Percentage of ketical residues predicted as strand and of strand residues predicted as hdix (127). 
^ PSIPBED results were published for different conversions of the ei^t DSSP states to toee states, 
•P- 

• The clfi'H? accuracy for the method based on amino acid oompoaitian is tsken firom the original publication (69). i.a., baaod on a different 
data set than aU other methods. 



this difference resulted from implementing new con- 
cepts into PROF CBost, mipublifihed; 9). 

CAUnON: OVEROPTXMISM HAS BECH^ME EVEN 
MOB£UKELYP 

Seemingly improving accuracy by ignoring slwrt 
segments. There are many vrayB to publish higher 
levels of accuracy. Among the simplest for secondary 
structure prediction is to oonvert 3xo helices and p 
tmlgee assigned Iqr DSSP (2) to nonregular atruc- 
tttre« This yields higher levels of accuracy ainoe all 
methods— on average— 4u:e better at predicting the 
middle of helices and atranda lhan Ihmr caps and 
hence are more aocorate Car longer regular second- 
asy structure s^ptnents (66, 67)« ^Then predicted 
secondary stnicture is used to predict 8D stnicture, 

*Note: I sdded llda flcdisa Krtiiig VliBt iMit^ d«r 



short helices are important. Thus, I suggest bearing 
with the more conservative conversion strategy* 

Comparing apples and oranges or too few apples 
with one another. To overstate the point: there is 
NO value in comparing methods evaluated on dif- 
ferent data sets- Most secondaiy structure predic- 
tion methods are available. Thus, developers may 
want to compare tbeir remjlta to public methods 
based on the same data aet Cnot previously usc^d for 
either of the two). Many methods predicting aspects 
of protein sUuclur e and fimddon must fig^t vrith 
limited data availabiKty. This is not at an the case 

for secondary structure prediction. Hundreds of new 
protein structures are added every year (55)- If for 
some reason or another, smaU data sets most be 
used^ dflvielopeta tihft^l^ painstakingly try to esti* 
mate vdiat IdgnSficant diffaenfieT means fcr ttetr 
data set For ttampla, 16 newprotdn atroe^^ 
dearly too M We caneni^ have wsulta from 
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aiaiqr predidaon meOiods fo r 16 p roteinfi. For that 
set, JFx«d2, PHD. PROF, FSIFBED. SAM«T996ec, 
and SSpro are indistinguisbable (58)! 

Seendngly achieue 100% accuracy by using corre- 
lated sets. Many publicaticniB on predicting second- 
ary structural class from amino add composition 
allowed correlations between *1xaining^ and testing 
seta. Gonaequently, levels of prediction accuracy 
published far exceeded the possible theoretical mar- 
gina (59). A very simple operational defiuoition for 
^dependent sets" is the foUo^mxxg: Two proteixxs A 
and B are correlated if the sequence aindlarity be- 
tween A and B suffices to predict the structure of B 
knowing A'$ structure. Aslsume we have two uncor* 
related sets of proteins Si and S2. Can we train the 
method on set SI and devdop it on set S2 without 
forHier ado? While developing FBOF, I realized that 
the answer is n^pative. In fact. I trained nenral 
networks on about 2000 structures that had no s^- 
nificant level of sequence similarify to our oti^^nal 
set of 126 proteins (22). I used the 126 protcinfl only 
after I had completed developing the method and 
found a prediction accuracy exceeding 80% (impub- 
liahed). When I tested PBOF on a set of abo^ 
new structures that had been added to FDB in the 
meantime (different from that given in Table I), 
prediction accuracy dropped. Do the 126 prot^s 
differ from the set used for Table I? I fadled to an- 
swer this question. Conclusion: test as test can; i.e., 
use as nxany independent sets of new strucstures as 
possible! 

EVA: Automatic evaluationrof automatic predic- 
tion servers. In collaboration with Volker Eyrich 
(Columbia), Marc Marti-Ronom and AndrQ Sali 
(both from Rockefeller), and Flormdo Pazos and 
Alfonso Valenda (both from CSNB Madrid), we have 
staxted to address the above problems throxigh the 
automatic server EVA (58). Leszek Rychlewski 
(HMCB Warsaw) and Dani Fischer (Ben-Gurion 
UniveTsity) are implementing similar ideas in live- 
Bench (60). The simple concept is the following: 
Take the N newest experimental structures added to 
FDB, send the sequences to all prediction servers, 
odDect the results, and aocumulate a continuous 
evaluation of prediction accuracy evexy week. EVA 
has been evaluating secondary structure prediction 
metiiods for more than 6 months now« I firand it 
instructive to see how the '^raDkmiT of methods 
tiaDy changed from weeic to wedc due to too sinall 
sets* Currmt^, EYA also provides vesuUa finr eval- 
tiating ocmiparative modeling (Sail groap) and resi- 

EVA wffl eveatoeOIy almpH^ 
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SSpro: Advanced recursive neural network system. 
The only method published recently that appears to 
improve prediction accuracy dgnificantly not 
through more divergent profiles but through the 
particular algorithm is SSpro (13). The major idea of 
tiie method aims at solving the following problem. 
When, e.g., training neural networks it is important 
to avoid correlations between training samples pre- 
sented succeflfiively to the eystem. A neural network 
may be preaented with the window around residue 
11 in protein X at time step T and residue 7 in 
proteiin Y at step T *f 1. Thus, the system never 
learns that secondary structure correlates between 
adjacent residues. The result is that regular second* 
ary structure segments are predicted — on aver- 
age—at a length half that observed (19). PHD ad- 
dressed tiiis problem by a 6econd4evel structure-to- 
structure netwoxk that was trained on the predicted 
secondary structure from the first4evel sequence-to- 
structure network (22). Most authors have since im- 
plemented this idea (iu particular PSBPEBD and 
JPred2). Pierre Baldi and colleagues deviated sub- 
stantially from this concept. Instead of using an 
additional network, they embedded the correlation 
into one single recursive neural network. In pnnd- 
ple, the idea of a recursive network had been imple- 
mented before (61), However, the particular details 
of the algorithm implemented in SSpro are novel 
and— aa Tehle I illustrates— prove highly success- 
ftd. 

HMMSTR: Hidden Markov models for connecting 
library of structure fimements. Can we predict sec- 
ondary structure for protein U hy local sequence 
similarity to segments of known structures {SI even 
when overall U differs frum any of the known struc- 
tures (S)? Yes, as shown by many neareat-n^ghbor- 
baaed prediction methods, the most successful of 
which seems to be NSSP (62). A conceptually quite 
different realisation of the same concept haa been 
implemented in HMMSTR by Chris Bystroff, David 
Baker, and colleagues (3). First, build a hbraiy of 
local stretches (3-19) of residues with Tiasic struc- 
tural moti&" a sites). Second, assemble these Icucal 
motifs throu^ Mdden Markov models introdudng 
structural context on the level of superseoondaiy 
« structure. Thus, the goal is to predict protdn atruc- 
ture through idoatification of ^"gram^^ 
protein structure formation." Although HB4MSTR 
intrinsicdly aims at fnedidmghic^ 
of 8D ctiuctore, a ride result is flie predict^ 
seoondazy struobom I find two lesidta surprise 
The mXbm do not find any ntgitffirant cflBac t c£ 
^Vyveroptimiaii^ their meOiod; Le^ HMMSTR ap- 
pears as aocarate in preJtcting aeecmdaiy atractura 
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finr protdxis known today as it win be far those 
known next year, (ii) moee-etate pcsr^reaidue aocu* 
mcsf is reported to be about 74% (3)« If this estimate 
is correct, HMMSTR ia more accurate at predicting 
aeoondaxy structure than most existing methods and 
almost as accurate as the state-of-the-art methods 
(Table 1). 

And the winner is ? The reason for the particular 
focus of this review on a small number of methods is 
largely that I could compare the selected methods to 
one another based on new proteins. A particular 
method that was not available to me may turn out to 
mark the most substantial breakOunugh in the 
field, A Danish group developed a netural network- 
based method that is most amaring in maxxy re- 
spects (63). (i) The auUiors estunate liie method to 
yield levels above 77% prediction accuracy (the title 
of thmr artide is slightly misleading). If true, this is 
the best current method. like PSIPBED, cnPred2, 
andPBOF, the method uses PSI-B LAST profiles as 
input and like most metbx>ds since PHD a two-level 
approach addressang the problem of predicting short 
Gegments. (ii) A concept that had not been published 
befijreis to replace the standard three output units 
(Cor helix, strand, and other), by nine output units 
additionally coding for the secondary structure 
states of tiie residues before and after the central 
one (dubbed "output exp^msion"). (Hi) Also new is the 
particular way of weighting the average over differ- 
ent networlss by the overall reliability of the predic- 
tion for ttkat network and the mere number of dif- 
£arent networks considered (up to 800!). This 
impressive number of networks may prevent laige- 
sc^Le genome analyses based on this method. How- 
ever, tlie m^or point is: Did the authors overesti- 
mate performance? The authors tested their method 
in a way Hiat moat developera would asaume to be 
error-proof. However, thdr testing protocol is very 
similar to the one that I applied wben significantly 
overestimating the accuracy of PROF (>81%), Obvi- 
ously, the similarity of these two situations may 
very well be purely coincidental! 

Plethora of new concepts for secondary atructure 
prediction. Ttv^ following five methods are a small 
subset of new ideas explored to improve secondary 
atractoie predictioiu Q ()uaU and IQ]^ (^^ 
neural n elwurka and rale-based statistiGs in a cas- 
cade cf classifiers. Based on a similar data set they 
estimate a level of prediction accurapy oomparable 
to that of JPred2 (see TaUe Q« (SQ Chandoiiia and 
Eaiplas (67) combined simplified ooiiiiit sdiemes 
(two out^ states) with networlm trains 
enfc tadca and apaxticdat vasdant of eai^ 
lopiit ia iiondivetgait alignments ptdoed from iiie 
tab aone (Big. S)« Baaed onaprotooddmilar to i^ 



applied hy tiie Banish group (6S)« the aufliors esti- 
mate a leeel of >76% accoracyy Le«, a level Hiat if it 
holds up is similar to SSpio (Table D. (mi Sc^os- 
edly the simplest new metiiod that daims to almost 
appioadi the performance of PHD combines the in- 
formation for secondary structure formation con- 
tained in amino add singlets, donblets. and triplets, 
(iv) Schmidler c^ aZ. (65) use a simple statistical 
model; the novel aspect ia to replace compiling sta- 
tistics over fixed stretches of 2V residues by eegments 
signifying regular secondary structure (helix, 
etrand). The underlying formalism resembles a hid- 
den semi-Markov model allowing one to explidtly 
incorporate particular propensities such as belbc 
caps (66). Based on noncomparable data sets the 
authors estimated prediction accuracy to be 69%; if 
correct, this is impressive for a method not using 
alignment information, (v) Without claims to sur- 
prising levels of accuracy, Figureau et d. (67) com- 
bine deveriy dxosen pentapeptides from the data- 
base to obtain tiie final predion. 

Secondary strwcturol class predicted almost as ac- 
curately as by experiment. Grrouping proteins into 
secondary structore dasaes (alUalpha, all4)eta, al- 
phaA)eta, and other) appears to be a ixseftd in i ti al 
approach for dassi&ing proteins ©7. 68). Sutpxis- 
ingly, sudi dasses can be predicted successfully 
based merely on the overall amino add composition 
of a protein (59, 69, 70). More and more increasingly 
complex and genial methods address this reduced 
goal; reported levels of prediction accuracy approach 
100%. Recentiy, Wang and Yuan explained these 
high values by insi^dent testing sdbemes and dial- 
lenged that a four-state accuracy of 60% compris^^ 
the maximum for methods based solely on composi- 
tion (59)« Obviously^ it ia much easier to predict dass 
starting from ti^e detailed information about evolu- 
tionaiy profiles for the entire sequence than by re- 
stricting the input to composition. In fact, the best 
current methods also improve tixe accuracy in pre- 
dicting secondary structure dass considerably (Ta- 
ble 1). The differences between observed and pre- 
dicted composition of secondary structure are now 
bdow 6% for helix and.8trand.ZI!hiais-&irly»clo8e to 
what experimental low-resolution (circular didm>- 
ism. Fourier transform-induced spectroscopy) meth- 
ods adneve at Ibeir best (67)« 

GCniBINmG MEDI0GHE AND GOOD METHODS 
HATBBBBST 

Cambmation improvea on nansyGtematic errors^ 
Any prediction niethod baa two aatnws of errors: Q) 
sygt«nailcenxira,e^tigxmghnflmlocaleflGB^^ 
(ii) white noise errors caused by, e^^ the B occegei^ 
of (iie doiiog tvaiidiig naurti netwutta. 
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^Rifiaretically^ <*ftTnW Tiing any niunber of methods 
Improves accuracy as long as the errors of the indi- 
vidual methods are mutually independent and are 
not only systematic (71). PHD— and more recently 
oQier methods (6, 57, 63) — used this fact in combin- 
ing different neural networks. The idea of combining 
different prediction methods has been around in 
secondary structure prediction for a long time (19); 
Cuff and Barton (see 4, 5) implemented it in JPred 
for different third-generation methods. In particu- 
lar, JPred ixses a simple expert rule for compiling 
the final average. King et al. (72) have tested a 
variety of different combination strategies. Selbig et 
oL (73) have compiled the jury throu^ an elabo- 
rated dedsion-tree-based eystenu (auermeur et al. 
(74) have used a more refined variant of the JPred 
idea of weighting methods. Overall, combinations of 
independent prediction mefliods seem to yield levels 
of aocoragr iM i gb^ than that of the single best 
method. However, for ev^ protdn one method 
tends to be clearly superior to the combined predic- 
tion (Fig. 2B). Is it really wise to include signifi- 
cantly inferior methods into a combined prediction? 
No: averaging over all methods used for EVA de- 
creased accuracy over the best individual methods, 
althou^ averaging over the better ones was better 
than averaging the best ones (Rest, unpublished 
results). Is the^ any criterion for when to indude a 
method and when not to do so? Concepts weighting^ 
the individual methods based on their accuracy and 
^'entropy* (63) appear successfiil only for large num- 
bers of methods (63; Bost, unpublished rieaults). 
Nevertheless, methods that are significantly over- 
trained can improve when combined (Krogh, unpub- 
lished results). More rigorous studies for the optimal 
combination may provide a better picture- Uie tech- 
nical problem of utilizing many methods in a public 
server is that the field is advancing too fast: today's 
methods are more accurate than averages over yes- 
terday's methods (hence the JPred server now re- 
turns JPred2 results by default). 

WHAT DOKS 76% ACCUHACnT MEAN, IN PKACTICE? 

Yourprotein may bepredioted worse or better tbxxtt 
ocMsroge. A few problemB in estimating expected 
prediction accuracy are described above. However^ 
another problem is relevant for users of prediciion 
methods: A sustained level of 76% acouraqr does 
NOT mean that 76% of the zeddues in your protein 
of unlmown structure U are correctly predicted. In 
contrast^ prediction accuracy varies aubstanlially 
lietween proteins CF^. it seeins that sudi vazl- 
atEons are intrinsic to any metiiod predicting aspects 
ofprafcein structure and foiifltion«^7hat call 
eicpect as accuracy fiir your proton wiien using a 
CfadM(<4ie^ meOiod} Glveo a dhrocBsiit &adly 



(Tsible ID, tlie answer is 66 *^86%. Do you leaxn fimn 
comparing different metiiods? 

Combining methofia improves on average but you 
may al^ lose. Averaging over many methods 
helps, on average. However, most often some meth- 
ods are more accurate than ilie average (Fig. 2B). 
Furthermore, there are examples of proteins pre- 
dicted poorly by aU methods (Pig. 2B), i.e., for which 
all methods agree by mistake (data not shown). 
Thus, trying to use many methods may not provide 
the answer to the question whether the prediction 
for your protein is more likely to be below or above 
average. Are there alternative ways to spot more 
reliably predicted regions? 

More reliable predictions are more accurate. Re* 
liability indices as provided by most methods corre- 
late very well with predictLon accmracy (Fig. 3). lids 
implies that you can easily identify r^ons Ibat are 
more likely to be predicted accurately than others. 
Furthermore, if your protein has many residues pre- 
dicted at low lev€^ of reUability» you may correctly 
suspect that your protein is predicted at a level 
below average. Hotting coverage versus accuracy 
(Fig. 3) also illustrates how benefidal more diver- 
gent profiles are to make predictions more useful. 
For example, PSIFRED has more than half of all 
residues predicted at levels that would be reached 
on average when comparing two known structures 
(75) (Fig. 8, dotted Une). 

AB£ SECONDARY STRUCTURE PREDICTIONS 
USBFUIv IN PRACTICE? 

Regions likely to undergo atructural change pre- 
dieted successfidly. Young et oL (1) have unraveled 
an impressive correlation between local secondary 
structure predictions and global conditions- The au- 
thors monitor regions for which secondary fitmcture 
prediction methods give equally sixong preferences 
for two different states. Such regions are processed 
combining simple statistics and expert rules. The 
final method ia tested on 16 proteins known to un- 
dei:go^tructor2d-Tearrangements and on a number- 
of other proteins, l^e authors r^Mirt no felse posi- 
tives and identify most known structural switches, 
8ubseqiieiil9y, the grwp appl^ 
n^osin family, idenlafying putative switching re« 
gions that were not known before, but appeared to 
be reasonable candidates (76). I find tins meiihod 
moat iwrnortrnblA fn two wsys: Q) it is the most 
general metiiod using predictiona of protein struc- 
tin» to predict aome aq^ecte of ftayc^^ 
fllofitiates that predictions may be use^ 
atroctnreaareknnwaCaBlntfaecase of tfae myosin 

fimflly). 
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FIG, 2, PredktionwjurtCf varies oubsteiiti^ 
to develop ^ of the methods shown (68). The oonaidershk differtnxaa in the three^te aocwracy betwaan diffeient proteind ifl ^ralid finr 
on methods (A, percentage of alllSO pro teins pr^dicfaad at a ghren level of aoc«r»cr. standard deviation 10 on Iha order of 10 percentage 
points). On av«ra0e. different metbodfl predict different pxotomo at hi^aer levels (B, for each protein and each method, the difference 
between the pec^protein average over all mx methods is shown; negative values imply that the respective metiiod i£ better than the 
avera^). Condusions: (i) If you predict secondary otnacture for yonr protein with a method of 76% aocura<7, the actual accuracy tor that 
protein may be anywhere between 50 and 90%. CD As to be expected: moat often some methods axe more aceorate than the average over 
many methedn 



CUxssifying proteins based on secondary structure 
predictions in the context of genome analysis. Pro- 
tons can be classified into fiamilies based on pre- 
dicted and observed secondary structure (27, 68). 
However, such procedures have been limited to a 
very coarse-grained grouping only exceptionaUy use- 
ful for inferring function CTable ll). Nevertheless, in 
particular^ predictions of membrane helices and 
coiled-coil regions are crucial for genome analysis, 
Recently, we came across an observation that may 
have important impUcatios^ for structural genom* 
ic8, ta particular: More than one-fifOi of all eukaxy- 
otic protdns ajq^^eared to haive regkma longer than 
60 residues apparently lacking any regular second- 
ax7 8tn)u±ore (77). Most of Uiese 1^ 
low complexity, Le., not compositLcm-faiafied. Sur- 
Fxiani^, these regiona nffpecied evolutionari^ aa 
oanserrcd as aU other regions in the respective pro- 
teins. TkoB application ceoondaiy atmctore inre- 
diction ni4y aid In claBsf^fing protdnSt 
domalna« and posdb^ even in identiC^i^ 
fimotfiooflil ^Dotifii* 



Aspects of protein function predicted based on ex- 
pert €incdy$i$ of secondary structure. The typical 
scenario in which secondaiy structure predictions 
facilitate learning about fimction is one in which 
experts combine their predictions and their intu- 
ition, most often to find similaritiea to proteins of 
known fimction but insignificant sequence similar- 
ity (39, 78-89). Usually, such applications are based 
on very specific details about predicted secondary 
structure Xsome examples are shown in Table ID. 
Thus, these successfol corrdationfi of secondary 
structure and function appear difiBcult to incorpo- 
rate into automatic melboda. 

Exploring secondary structure predictions to im* 
prove cfato6oMtf jearcfcea. Initially^ three grovq[>s hx- 
d£^>endenttjr applied wwrmdmy structure predic- 
tions for fold reoQgoitioi^ La^ the detection of 
structural similaritfces between proteins of nnielated 

eeq^ieuoea A fb(W Sfean lato, 

^Vr fold xoa^Miim/fhn$Steie neliiod has 

adopted tide conoQ« <9&-102). Two 
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TABUS n 

Using Secondary Structox^ Predictions, in Practice 



How to obtain the best remilts? 



Identify mambranfi proteizis? 



CUesx^ thmi^ eoiled-ooil leglozie? 



Classify tbrou^li oeocmdaxy stracture contanl? 



Idteitify <lftmnmc or Structural re^^ns? 



Monitor influences of point mutations? 



Find liindiog sites or motife? 



Infer fiimdaonal/Btractural ouniiarilyt 



The malcr source of improvement i$ the divereence of the multiple sequence 
ahgument usod for prediction. Ttaa, if you have a email fiunHy, the 
expected piediction accuracy ifi loWer« 
Particulaily seusitive to divei^gence are the reliability indioas; Lc^ less 

divergence yieldfi overestimated reliability indicee. 
The most successful strategy to find the most reliably predicted regiotifi may 
be to uao the reliability index provided by a method rather than the 
aeraemcnt between different mathoda. 
If you knov/ thera are nonglobular or structural domains in your protein. 

chop it up before you build the alignment. 
If you caa improve the alignment* try to do ao before the prediction. 
Predicted membrane helices indicate that your protein io not globular. The 
accurate membrane predicfcioiifi are usually more reliable than those fiur 
lobular proteins. Thus, membrane helix pred i ctions should be given 
preference. Globular methoda otften do not predict globular heliceo at 
jMsitaona of membrane helices; rather, often membrane helices are 
predicted as strand by mistakenly applied globular methods. In contraat, 
lobular t p^^r^^ appear relatively more accurate for porm-like beta* 
strand membxaiie xe^ono. 
IMaction ef memhrane proteins hao leeo tiian a 3% etxur rate for the best 
methods. Moot hflUcfts are correctly predicted, yet thoimmhar of 
may nevertheless vaiy« Hehx caps axe deazly predicted inaccarately , Note 
that general methods predicting three-state secondary structure for 
globular proteins also predict caps less accurately, 
p^^iciionft of long coiled«CQil regjona deoxty indicate that your protein io 
locally non^bular, Long coildd-oail proteins are likely to be structural 
p F X nt ffli ^t lionger re^ons are predic^jed more acenratdy. 
Classifying pFoteina according to Ihe secondary otcuctax« eompoalion is 
bdpful, but aibitMy, Oiw hope may be to ii^^ 
aecondazy etructdxa content that a partiaOar in?^^ 
However, attempt fails, since known protein structureo vazy 
aignificantly between 10 and S0% of regular seooudaiy structure (helix, 
strand). Thus, oeoondary structure composition doaa not help to predict 
l^bularity. 

If yoti coe two separate seoondaxy structure patterns, you may suspect that 
the protein has two etroetural domains. An extrane example is an N- 
terminal aB-alpha region and a Ctenoinal all-beta region. 
If you have to cut your protein, stay more than two residues away from 

predicted hehces and strands. 
Secoxidary structure prediction methods arc— on average — as accurate in 
pxedicting the overall content of secondary otructure as are careful CD 
and FTIR methods. However, such methods allow you to mon ito r in detail 
Btructuzal refipon£<& to mutationa. Such changes arc less likely to be 
reflected as accurately by loediction methods. 
BAoat often, KiT^^^mg ates lie in nonregular aecondary structure elementf. 
For CTiffmr^^- we have not predicted regular aecandary structure for any 
of the known nuclear loreliyfttion rignals 
Secondaxy fitructurc predictions do not euffioa to identify binding motifia, 
sudi as the ziuc-finger 11 motif. However, the combiwatton of sequence 
motif and predicted secondary atructare may be very helpfuL 
If you know the fiinctioDibtrudure'tif'piotein A-ao^ iolar v^ifilher B 
diaxea ^ &nction/stnictare, a mmilaxsty in 0ia locd 
structure may hd^ you aubatantially. 



extended the conoc^ hy not only refinmg the data- 
base search, but 1^ acfcnally refimng fbe quatUy of 
the afignment fhroQC^ an tterathre pcooednre (50, 
108). A xdafeed strategy baa been im^lorad Ng 
flift TTiPftilrrtffii tt% ImprovtR prftlifidnina and align- 



FramWpi^dicdcmato2Dand8Dstruc^^ .Are 
Geoondaiy atructare prodictiona accurate enou^ to 
bdp predict hif^ber order aspects of proteia atruc- 
toxe autsmnatiffnlly? For 2D Onterreddne cootaets) 
predictionfi, Baldi H oL (lOS) bove reoenily im- 
proved the level of aecniaiy in predicting ^-eteB>>A 
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FKG.S. Prtdiction accaara<y cgrrglAtea with reliabilily. The cnn d nfliftti from Fig. tA to that ycm have & poor idea of how wdl a n^tfaad 
petfbrms when allied to tout luoteui of unknown eirachire. Fortunately, there to a way out of this dilftimna: Moat methoda now provide 
inPYipi??^ wwrtoetirmg thfi reliability of the prediction for eocfa rdmdud. Shown is the ac curacy v'orftufltha cumulative peroeotage9 cft ttsidncs 
predicted at a given leval of rehahility {ofrror^ va accuracy). For example, FStPBED anii PROF readi a levol ahom 88% fiir about 60% 
of on reddndfi (da^Sied line). Thto p^xticular line to diosen 9uicq edoondary structure assigDJoentd by BSSP agree to about 88% proteins 

rf camflai- cfc^irtiyT'' A^^-^^'^g^ JPy^?, iff cyp^y two^wnlly nrrrtnii- thm PSTPflJen And PROF fTable it TGAchAS flofl Icvel of aCCUracy 

forlesQ than half of all residues, exclusions: GO Eelialnlifcy indices are extremely TaluaUe to ^t re^ons of more-likdy-4a4M-conect 
predictions. Co) These indi<?es also address the problem of vaiiatioii: if many zefiidues are predicted with hi|^ r el i abi li t y, your protwi to 
snore likely to be predicted xaore accurately than avezaga (Fig. 2A). 



pairings over earlier work (106) by using another 
elaborate neural network system. For SD predic> 
tions. the following liat of five groups exemplifies 
Chat aecondazy structure predictions are now a pop- 
ular first step toward predicting 3D atnicture. (i) 
Ortiz et al, (107) succesefuUy use secondary struc- 
ture predictious as one component of their SD struc- 
ture prediction method, (ii) Eyrich et al. (108, 109) 
minimize the energy of arranging predicted rigid 
secondary structure segments, (iii) Lomize et oZ. 
(110) also start from secondary structure figments, 
(iv) Chen et at (111) suggest using: secondary struc- 
ture predictiona to reduce the complexity of molecu- 
hx dynamics simulations, (v) Levitt and oo-woriters 
(see 112, 113) combine secondary 8tructure4)a8ed 
simplified presentations with a particiilar lattice 
dmulataon attempting to enumerate all possible 
folds- 

AND WHAT IS THK XJMCT OF FUfimCUON 
ACSCOBACn 

68% U a limit, but shall we ever reach dose to 
ffters? ni^otelnaeanidafy structure £6i^^ 
flueDoedbjkiie-rBiige interacttons (45« 46« 114) and 
Iqr tiie eavbomnent (1, 116). QoDMeqfMPaiOj. 



stretches of up to 11 adjacent residues (dubbed cha- 
meleon after (114)) can be found in different second- 
aiy structiure states (116-118). Implidtly, such non- 
local effects are contained in the exchange patterns 
of protein familiea . This is reflected by Hie fact that 
strand is predicted almost as accurately as belix 
(Table I), although sheets are stabilized by more 
nonlocal interactions than helices. Local profiles can 
even suffice to identii^ structural switches (1, 76). 
Surprisingly, we can find some traces cf folding 
events in secondary structure predictionfi (119). 
E>ven more amazing is a study suggesting that align- 
ment-based methods achieve levels of aocuracgr for 
chameleon regions similar to those for all other re- 
gions (118). Seoondaiy structure as aignme nts may 
vary for two versions of the same etructuxe. One 
reason is that protein structures are not rocks hat 
^^rnamic objects ^iix some re^ons being mate mo- 
bile than others. Another reason is that any assign- 
ment method must choose particular thresholds 
(e^«, DSSP chodses aeot-off in the Ooolomb eneigy 
ofa hyd rogcaliond)« Goneeqaentiiyt afffiignmimtai^- 
fier by about 6-4£ pexoentai^ points be 
ent X-My ip«8laii8 or diflRnr^ 
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same protein CAndersen and Bost, unpublished re- 
sults), and by about 12 percentage pobits between 
structural homologues (75). The latter number pro- 
vides the upper limit for secondary structure predic- 
tion of error-free comparative moddfing. I doubt that 
ab initio predictions of secondary structure wUl ever 
become more accurate than that. Hence, I believe a 
value of around 88% constitutes an operational up- 
per limit for prediction accuracy- After the advances 
over the past 2 years we reached greater than 76% 
accuracy. Thus, we need to achieve another 12 per- 
centage points (or even less). What is the major 
obstacle to reaching another 6 percentage points 
bigger? The size of the experimental database as 
suggested (117)? I doubt this, since PHDpsi trained 
on only 200 proteins using PSI-BLAST input is al- 
most as accurate aa PSIFRED trained on 2000 pro- 
teins (Table D< Will the current explosion of se- 
quences boost accuracy? In fact, current databases 
have less Chan 10 homologues for more than one 
linrd of the 150 protdns tested (Table 1) and more 
than 100 for oidy 20% of the proteins* Although 
based on too a small set to draw condusions, for 
these 20% hig^y populated £san£ties the accuracy of 
PROF was 4 i>ercentage points above average (data 
njot shown). Thus, larger databases may get us € 
pocentage points higher, and it may not. The an- 
swer remains nebulous. 

DISCUSSION 

Methods improwd significandy over the past 2 
years. Growing databases and improved search 
techniques (Fig- l)--predominantly through the it- 
erated PSI-BLAST tool— yielded a substantial im- 
provement in secondary structure prediction accu- 
racy over the past 2 years. State-of-the-art methods 
now reach sualained levels of 76% prediction accu- 
racy (Table Even more impressivdy, about 60% of 
all residues are predicted at levels reaching the level 
of agreement between X-ray and NMB structures 
(Pig. 3). However, novel ideas have also been shown 
to improve prediction accvuracgr- A standard way to 
increase tfie-oonfidenoe-in.a-parti<nilar prediction is 
to look at the results from many diflferent prediction 
methods* This strategy is frequently successfrd and 
has been brought to perfection over recent years. 
However, often the best method is better than the 
average onrer many methods (Fig. 2B)« While stnio- 
tore predi^ion is coming of age, developers and us- 
ers dowly leam to xedooe overesttmatinnfl, How- 
ever^ the correlaUans between protons at times of 
ftoftsbase eKplosions are becoming more diflicalt to 
conteoL It seema tiiat oidy cxmtfiiuousi automatic 
evabiation servers will be able to handle tUs chal- 
kDgi in ^ future 068, 00). 



Secondary structure predietiona are at the base of 
structure-based sequence analysis* Almost a de- 
cade after the original breakthrou|^ pre^ction 
methods are now increasin^y explored by wet-lab 
biologists to analyze tibyeir protein of interest. Sec- 
ondary structure predictions are used automatically 
by methods aiming at higher dimensional aspects of 
protein structure and at improving database 
searches and alignment accuracy. One method has 
successfully related secondary structure predictions 
automatically to functional aspects (1, 76), However, 
secondary structure-based identifications of binding 
sites or other fimctional aspects are still restricted to 
single-case expert analyses* 

And njow we run human? The field has advanced 
considerably over the past 2 years, and more im- 
provement appears to lie aheadL Prediction methods 
are fast enough to analyze entire genomes, and for 
particular examples the resulting classifications are 
relevant to structural and fimctional genomics (28, 
68), Nevertheless, to play the deviTs advocate: The 
field is not up to the challenge of the human se- 
quences to be dubbed into the database v^ty soon. 
We are it>^yring a variety of approaches relating 
secondary structure jMcedictionB eiqilidtiy to func- 
tion, auch as given by ASP (1). Obvious^, this re- 
mark may apply to bioinfbrmatics, in general: Ithe 
year 2001 wfll commence with the publication of tlie 
entire human genome; we must rush to get ready for 
the data flood. 

Tbasks ar« flxfr«ndpd to Jinfeng liu (Gdomlna University) for 
computer assistance and titie QoPodimn of gename data sets; to 
Jinfeng liu and Dariusz Pnybylald (Columlna Unhrenity) fa- 
providing prcliminaxy uifbrmaticm and ptx>grams; and to Claus 
Anda;BQa aiid Siiren Bninak (CBS CopoibagtsxO fi^ 1^ 
ments on the manuBcript Particulnr thanks are due to Voilkar 
£ jiich (CoLanibia Unhrcrsity) for ftoffniaamc wul innintnnTing 
most of tbe tnunensoly valuabla software tibat nmfl thsEVA afid 
KIBTA-IVedlictPrQtein servers! 
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